Task introduction

Text-to-video retrieval aims to automatically locate the video segments most semantically relevant to a given natural language description from a large-scale video database.

Evaluation Dataset

MSR-VTT

Data description：

MSR-VTT stands for Microsoft Research Video to Text, is a large-scale data set containing videos and corresponding text annotations. It consists of 10,000 video clips from 20 categories. Each video clip contains 20 English sentence annotations.

Dataset structure：

Amount of source data：

The dataset is split into train(6513), validation(497), test(2990), each video has 20 captions.

Data detail：

KEYS	EXPLAIN
vid	video
texts	captions of the video

Sample of source dataset：

vid:
Alt text
texts:

a baker is demonstrating a cooking technique
a female giving a baking demonstration in her kitchen
a girl explaining to prepare a dish
a lady with a scarf is cooking with dough
a person is preparing some food
a person making pastries
a woman is making a pastry
a woman is rolling doe
a woman is rolling dough around a stick
a woman is rolling dough
a woman is rolling dough
a woman is wrapping dough around some food item
a woman rolling up pastry while giving instructions
a woman rolls dough
a woman showing an easy way to make crescent rolls
how to prepare food rolls
the pastry should have five creases
a person is preparing some food
a woman is rolling dough around a stick
a woman rolls dough

Citation information：

@inproceedings{xu2016msr-vtt,
author = {Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong},
title = {MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
year = {2016},
month = {June},
publisher = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
}

UCF-101

Data description：

UCF101 is a video dataset with 101 action categories collected from YouTube by the University of Central Florida, containing a total of 13,320 videos.

Dataset structure：

Amount of source data：

The dataset is split into train(9537) and test(3783).

Data detail：

KEYS	EXPLAIN
vid	video
label	the label of the video

Sample of source dataset：

vid:
Alt text

label:
Playing Basketball

Citation information：

@article{soomro2012ucf101,
  title={UCF101: A dataset of 101 human actions classes from videos in the wild},
  author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
  journal={arXiv preprint arXiv:1212.0402},
  year={2012}
}

Task introduction ​

Evaluation Dataset ​

MSR-VTT ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

UCF-101 ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data detail： ​

Sample of source dataset： ​

Citation information： ​

Task introduction

Evaluation Dataset

MSR-VTT

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：

UCF-101

Data description：

Dataset structure：

Amount of source data：

Data detail：

Sample of source dataset：

Citation information：